Project Description¶
This project explores the causes of flight delays and cancellations through data analysis. We will examine aviation data to identify the key factors and airlines most affected by disruptions in the Pacific Northwest by applying data wrangling, data visualization, and exploratory data analysis techniques.
A prominent airline company in the Pacific Northwest has collected extensive data on flights and weather conditions. They seek insights into what influences departure delays and cancellations, which will help improve both operational efficiency and passenger experience. As data analysts, we will carry out a comprehensive analysis to support their goals.
The aviation industry is complex, with various factors impacting flight operations. To ensure our findings are relevant and actionable, we will focus on data from the pnwflights2022
dataset provided by the ModernDive team. This dataset includes flight records from the first half of 2022 for two major airports in the region: Seattle-Tacoma International Airport (SEA) and Portland International Airport (PDX).
The data comes in two CSV files:
flights2022.csv¶
This file contains information about each flight, including:
Variable | Description |
---|---|
dep_time |
Departure time (in hhmm format). A missing value indicates a cancellation |
dep_delay |
Departure delay in minutes (negative values indicate early departures) |
origin |
Origin airport code (IATA format) |
airline |
Name of the airline or carrier |
dest |
Destination airport code (IATA format) |
flights_weather2022.csv¶
This file includes the same flight data as above, along with weather conditions at the time of departure:
Variable | Description |
---|---|
visib |
Visibility in miles |
wind_gust |
Wind gust speed in miles per hour |
Objective¶
Our goal is to uncover patterns and insights related to flight delays and cancellations in the Pacific Northwest. We will analyze which variables, including weather conditions and airline-specific patterns, most significantly affect flight performance.
Which airlines and routes are most affected by flight delays, and what impact does wind have on departure delays?¶
# Load required libraries
import pandas as pd
import matplotlib.pyplot as plt
# load the dataset
flights2022 = flights2022 = pd.read_csv("flights2022.csv")
flights_weather2022 = pd.read_csv("flights_weather2022.csv")
# Create route column
flights2022["route"] = flights2022["origin"] + "-" + flights2022["dest"]
flights2022.head()
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | ... | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | airline | route | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | 1 | 1.0 | 2359 | 2.0 | 604.0 | 618 | -14.0 | UA | ... | N405UA | SEA | IAH | 221.0 | 1874 | 23 | 59 | 2022-01-01T23:00:00Z | United Air Lines Inc. | SEA-IAH |
1 | 2022 | 1 | 1 | 1.0 | 2250 | 71.0 | 242.0 | 142 | 60.0 | AS | ... | N265AK | SEA | FAI | 193.0 | 1533 | 22 | 50 | 2022-01-01T22:00:00Z | Alaska Airlines Inc. | SEA-FAI |
2 | 2022 | 1 | 1 | 10.0 | 2355 | 15.0 | 759.0 | 730 | 29.0 | AS | ... | N274AK | SEA | ATL | 261.0 | 2182 | 23 | 55 | 2022-01-01T23:00:00Z | Alaska Airlines Inc. | SEA-ATL |
3 | 2022 | 1 | 1 | 25.0 | 2350 | 35.0 | 606.0 | 550 | 16.0 | AS | ... | N281AK | SEA | ORD | 193.0 | 1721 | 23 | 50 | 2022-01-01T23:00:00Z | Alaska Airlines Inc. | SEA-ORD |
4 | 2022 | 1 | 1 | 35.0 | 2349 | 46.0 | 616.0 | 545 | 31.0 | UA | ... | N426UA | PDX | ORD | 196.0 | 1739 | 23 | 49 | 2022-01-01T23:00:00Z | United Air Lines Inc. | PDX-ORD |
5 rows × 21 columns
flights_weather2022.head()
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | ... | route | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | 1 | 1.0 | 2359 | 2.0 | 604.0 | 618 | -14.0 | UA | ... | SEA-IAH | 33.0 | 23.0 | 66.06 | 160.0 | 8.05546 | 9.270062 | 0.0 | 1022.9 | 10.0 |
1 | 2022 | 1 | 1 | 1.0 | 2250 | 71.0 | 242.0 | 142 | 60.0 | AS | ... | SEA-FAI | 32.0 | 23.0 | 69.04 | 170.0 | 9.20624 | 10.594357 | 0.0 | 1023.4 | 10.0 |
2 | 2022 | 1 | 1 | 10.0 | 2355 | 15.0 | 759.0 | 730 | 29.0 | AS | ... | SEA-ATL | 33.0 | 23.0 | 66.06 | 160.0 | 8.05546 | 9.270062 | 0.0 | 1022.9 | 10.0 |
3 | 2022 | 1 | 1 | 25.0 | 2350 | 35.0 | 606.0 | 550 | 16.0 | AS | ... | SEA-ORD | 33.0 | 23.0 | 66.06 | 160.0 | 8.05546 | 9.270062 | 0.0 | 1022.9 | 10.0 |
4 | 2022 | 1 | 1 | 35.0 | 2349 | 46.0 | 616.0 | 545 | 31.0 | UA | ... | PDX-ORD | 33.0 | 19.0 | 55.75 | 120.0 | 6.90468 | 7.945768 | 0.0 | 1025.1 | 10.0 |
5 rows × 29 columns
Visualize to show the top 9 highest number of cancellations by route and the top 9 highest average departure delays by airline.¶
Analysis by route¶
# Calculate mean departure delay and number of canceled flights for each unique flight route
routes_delays_cancels = flights2022.groupby("route").agg(
mean_dep_delay=("dep_delay", "mean"),
total_cancellations=("dep_time", lambda x: x.isna().sum())
).reset_index()
# Identify routes with the highest mean departure delays
top_routes_by_delay = routes_delays_cancels.sort_values("mean_dep_delay", ascending=False).head(9)
top_routes_by_delay.head()
route | mean_dep_delay | total_cancellations | |
---|---|---|---|
13 | PDX-DSM | 35.782609 | 0 |
19 | PDX-GRR | 35.739130 | 2 |
17 | PDX-FLL | 30.893617 | 3 |
109 | SEA-MIA | 29.916667 | 4 |
71 | SEA-CLT | 27.313199 | 14 |
# Identify routes with the highest number of cancellations
top_routes_by_cancellations = routes_delays_cancels.sort_values("total_cancellations", ascending=False).head(9)
top_routes_by_cancellations.head()
route | mean_dep_delay | total_cancellations | |
---|---|---|---|
103 | SEA-LAX | 7.195969 | 93 |
58 | SEA-ANC | 9.316991 | 78 |
137 | SEA-SFO | 7.054403 | 73 |
102 | SEA-LAS | 8.943503 | 73 |
121 | SEA-PDX | 7.730982 | 71 |
# Create a bar graph for the highest number of cancellations by route
top9_route_cancels_bar, ax = plt.subplots()
ax.bar(top_routes_by_cancellations["route"], top_routes_by_cancellations["total_cancellations"])
ax.set_xlabel("Route")
ax.set_ylabel("Total Cancellations")
ax.set_title("Routes with Highest Number of Cancellations")
ax.set_xticklabels(top_routes_by_cancellations["route"], rotation=90)
plt.show()
plt.close()
C:\Users\newbe\AppData\Local\Temp\ipykernel_2412\3812851333.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax.set_xticklabels(top_routes_by_cancellations["route"], rotation=90)
Analysis by airline¶
# Find mean departure delays by airline
airlines_delays_cancels = flights2022.groupby("airline").agg(
mean_dep_delay=("dep_delay", "mean"),
total_cancellations=("dep_time", lambda x: x.isna().sum())
).reset_index()
# Identify airlines with the highest mean departure delay
top_airlines_by_delay = airlines_delays_cancels.sort_values("mean_dep_delay", ascending=False).head(9)
top_airlines_by_delay.head()
airline | mean_dep_delay | total_cancellations | |
---|---|---|---|
7 | JetBlue Airways | 42.141524 | 33 |
1 | Allegiant Air | 26.608696 | 6 |
2 | American Airlines Inc. | 20.129903 | 85 |
4 | Frontier Airlines Inc. | 14.355856 | 12 |
9 | Southwest Airlines Co. | 11.407163 | 100 |
# Identify airlines with the highest number of cancellations by airline
top_airlines_by_cancellations = airlines_delays_cancels.sort_values("total_cancellations", ascending=False).head(9)
top_airlines_by_cancellations.head()
airline | mean_dep_delay | total_cancellations | |
---|---|---|---|
0 | Alaska Airlines Inc. | 6.739282 | 1300 |
3 | Delta Air Lines Inc. | 10.648634 | 383 |
6 | Horizon Air | 6.270965 | 238 |
8 | SkyWest Airlines Inc. | 4.596561 | 165 |
9 | Southwest Airlines Co. | 11.407163 | 100 |
# Create a bar graph for highest mean departure delay by airline
top9_airline_delays_bar, ax = plt.subplots()
ax.bar(top_airlines_by_delay["airline"], top_airlines_by_delay["mean_dep_delay"])
ax.set_xlabel("Airline")
ax.set_ylabel("Mean Departure Delay")
ax.set_title("Airlines with Highest Mean Departure Delays")
ax.set_xticklabels(top_airlines_by_delay["airline"], rotation=90)
plt.show()
C:\Users\newbe\AppData\Local\Temp\ipykernel_2412\1878166829.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax.set_xticklabels(top_airlines_by_delay["airline"], rotation=90)
Determine if 10-mile-per-hour wind gusts or more have a larger average departure delay for both of SEA and PDX.¶
# Are departure delays impacted by 10+ mph winds from each airport
flights_weather2022["group"] = flights_weather2022["wind_gust"].apply(lambda x: ">= 10mph" if x >= 10 else "< 10 mph")
wind_grouped_data = flights_weather2022.groupby(["group", "origin"]).agg(
mean_dep_delay=("dep_delay", "mean")
)
print(wind_grouped_data)
mean_dep_delay group origin < 10 mph PDX 6.686966 SEA 7.857595 >= 10mph PDX 9.147024 SEA 9.231060